19/03/2020

Updates

Welcome to Zoom

Online lecture mode

  • All lectures via Zoom, same time/day as usual.
    • Lectures are recorded (30 days available).
    • Preferably no breaks, max. 90 minutes straight.
  • Materials online, as usual.
  • Last two sessions (7 May, 14 May): Q&A session in Zoom instead of presentations/discussion in classroom.

Online examination mode

  • Part I: take-home exercises: No changes. To be handed out on 7 May, to be handed in on 8 June, 16:00.
  • Part II: project presentations: presentations recorded as ‘screencast’ (voice-over-slides).
    • Basically still the same requirements: use Rmd to create slides, presentations of 6-7 minutes max., etc. The only difference is how you deliver your presentation.
    • See here for tips on how to make a screencast.
    • Hand in your presentations by 14 May 2020, 23:59.
    • See assignment in StudyNet/Canvas.

Recap Week 4

Bindings basics

  • Objects/values do not have names but names have values!
  • Objects have a ‘memory address’/identifiers.
x <- c(1, 2, 3)

Copy-on-modify

  • If we modify values in a vector, actual ‘copying’ is necessary (depending on the data structure of the object…).

Data structures and modify-in-place

Improving performance

  • Bottleneck(s) identified, what now?
  • See previous examples for typical problems in a data analytics context.
  • Vast variety of potential bottlenecks. Hard to give general advice.

Programming with Big Data

  1. Which basic (already implemented) R functions are more or less suitable as building blocks for the program?
  2. How can we exploit/avoid some of R’s lower-level characteristics in order to implement efficient functions?
  3. Is there a need to interface with a lower-level programming language in order to speed up the code? (advanced topic)
  • Independent of how we write a statistical procedure in R (or in any other language, for that matter), is there an alternative statistical procedure/algorithm that is faster but delivers approximately the same result.

Issues to keep in mind

  • Vectorization.
  • Memory: avoid copying, pre-allocate memory.
  • Use built in primitive (C) functions (caution: not always faster, if aim is precision).
  • Existing solutions: load additional packages (read.csv() vs. data.table::fread()).
    • Focus of what follows in this course (approach taken in Walkowiak (2016)).

Procedural view and further reading

Goals for today

  1. Know basic strategies for out-of-memory operations in R.
  2. Know basic tools for local big data cleaning and transformation in R.
  3. Understand (in simple terms) how these tools work.
  4. (Recap of virtual memory concept)

Virtual Memory

Virtual memory

  • Operating system allocates part of mass storage device (hard-disk) as virtual memory.
  • Process/application uses up too much RAM, OS starts swapping data between RAM and virtual memory.
  • Processes slow down due to swapping.
  • Default (OS) usage of virtual memory concept is not necessarily optimized for data analysis tasks.

Virtual memory

Virtual memory: example (linux)

‘Out-of-memory’ strategies

  • Use virtual memory idea for specific data analytics tasks.
  • Two approaches:
    • Chunked data files on disk: partition large data set, map and store chunks of raw data on disk. Keep mapping in RAM. (ff-package)
    • Memory mapped files and shared memory: virtual memory is explicitly allocated for one or several specific data analytics tasks (different processes can access the same memory segment). (bigmemory-package)

Chunking data with the ff-package

Preparations

# SET UP --------------

# install.packages(c("ff", "ffbase"))
# load packages
library(ff)
library(ffbase)
library(pryr)

# create directory for ff chunks, and assign directory to ff 
system("mkdir ffdf")
options(fftempdir = "ffdf")

Chunking data with the ff-package

Import data, inspect change in RAM.

##             used  (Mb) gc trigger   (Mb) limit (Mb)  max used   (Mb)
## Ncells   1364867  72.9    2521611  134.7         NA   1971957  105.4
## Vcells 122308906 933.2  213476332 1628.7      16384 211148640 1611.0
mem_change(
flights <- 
     read.table.ffdf(file="../data/flights.csv",
                     sep=",",
                     VERBOSE=TRUE,
                     header=TRUE,
                     next.rows=100000,
                     colClasses=NA)
)
## read.table.ffdf 1..100000 (100000)  csv-read=0.811sec ffdf-write=0.13sec
## read.table.ffdf 100001..200000 (100000)  csv-read=0.917sec ffdf-write=0.083sec
## read.table.ffdf 200001..300000 (100000)  csv-read=0.927sec ffdf-write=0.101sec
## read.table.ffdf 300001..336776 (36776)  csv-read=0.413sec ffdf-write=0.061sec
##  csv-read=3.068sec  ffdf-write=0.375sec  TOTAL=3.443sec
## -32.2 MB

Chunking data with the ff-package

Inspect file chunks on disk and data structure in R environment.

# show the files in the directory keeping the chunks
list.files("ffdf")
##   [1] "clone1164340d2f5b.ff" "clone11644bc3efa.ff"  "clone11645998a1fa.ff" "clone116462ca9183.ff"
##   [5] "clone116463e06a3d.ff" "clone1164a770be4.ff"  "clone6bcb2e24c834.ff" "clone6bcb5fea39ff.ff"
##   [9] "clone6bcb6b40b62f.ff" "clone6bcb71c8c692.ff" "clone6dad29195c0d.ff" "clone6dad5e007704.ff"
##  [13] "clone6dad5e113836.ff" "clone6dad5f3f4455.ff" "clone6f3c3cabaafc.ff" "clone6f3c3efb8466.ff"
##  [17] "clone6f3c4a6a8942.ff" "clone6f3c5c127f66.ff" "clone6f3c5f185d2e.ff" "clone6f3c6df29daf.ff"
##  [21] "ff1164225efe05.ff"    "ff11644a0ef699.ff"    "ff11644da01a0f.ff"    "ff116461078291.ff"   
##  [25] "ff6bcb22daf90e.ff"    "ff6bcb2775d020.ff"    "ff6bcb2aecb94d.ff"    "ff6bcb4e400ee.ff"    
##  [29] "ff6dad29ba44cd.ff"    "ff6dad3d49fe9e.ff"    "ff6dad41390084.ff"    "ff6dad70a27f54.ff"   
##  [33] "ff6f3c18256548.ff"    "ff6f3c1856562c.ff"    "ff6f3c71196131.ff"    "ff6f3cba7007a.ff"    
##  [37] "ffdf1164103b1b7.ff"   "ffdf1164112c58ec.ff"  "ffdf116411791fa2.ff"  "ffdf116412474209.ff" 
##  [41] "ffdf11641470007d.ff"  "ffdf116416181227.ff"  "ffdf1164164d0ea9.ff"  "ffdf1164179057a4.ff" 
##  [45] "ffdf116417a90f42.ff"  "ffdf116418d3c488.ff"  "ffdf1164198865e6.ff"  "ffdf11641aee92cf.ff" 
##  [49] "ffdf11641b6b46b.ff"   "ffdf11641c5fdccb.ff"  "ffdf11641cfd84af.ff"  "ffdf11641d46099f.ff" 
##  [53] "ffdf11641ebb89c0.ff"  "ffdf11641feddae.ff"   "ffdf11641ff8b447.ff"  "ffdf11642115504c.ff" 
##  [57] "ffdf116421dd20f.ff"   "ffdf116422887bc9.ff"  "ffdf11642357555b.ff"  "ffdf116424e861d9.ff" 
##  [61] "ffdf116425d63983.ff"  "ffdf116426d6da14.ff"  "ffdf116426ded3d3.ff"  "ffdf1164270bcda4.ff" 
##  [65] "ffdf116428290a.ff"    "ffdf1164296546d7.ff"  "ffdf11642a50ac03.ff"  "ffdf11642a6816ae.ff" 
##  [69] "ffdf11642b211778.ff"  "ffdf11642c468bee.ff"  "ffdf11642d7c2ec2.ff"  "ffdf11642e76dad5.ff" 
##  [73] "ffdf11642f256b58.ff"  "ffdf116430a79f0a.ff"  "ffdf1164323039df.ff"  "ffdf116433b2520a.ff" 
##  [77] "ffdf116433e1d9e2.ff"  "ffdf1164380de27c.ff"  "ffdf116438707e8.ff"   "ffdf116438bcb88.ff"  
##  [81] "ffdf11643aa4db7d.ff"  "ffdf11643b6d019a.ff"  "ffdf11643d749a16.ff"  "ffdf116441aa7696.ff" 
##  [85] "ffdf1164440663f.ff"   "ffdf1164442e095a.ff"  "ffdf116445301906.ff"  "ffdf1164462b54ac.ff" 
##  [89] "ffdf116446c40c31.ff"  "ffdf1164480a0ae9.ff"  "ffdf116448120c07.ff"  "ffdf116448492a21.ff" 
##  [93] "ffdf1164487b6b35.ff"  "ffdf116449d9ec22.ff"  "ffdf11644b8cccf7.ff"  "ffdf11644c9e4f9a.ff" 
##  [97] "ffdf11644cba627a.ff"  "ffdf11644e26990c.ff"  "ffdf11644e6f7756.ff"  "ffdf11644edf4f89.ff" 
## [101] "ffdf11644f7fb12d.ff"  "ffdf116453978601.ff"  "ffdf11645666f656.ff"  "ffdf1164569dc12.ff"  
## [105] "ffdf116458bafc66.ff"  "ffdf11645c10ba0c.ff"  "ffdf11645c28c230.ff"  "ffdf11645c5795ca.ff" 
## [109] "ffdf11645d23d873.ff"  "ffdf11645e9d060a.ff"  "ffdf11646047e966.ff"  "ffdf116460975ef0.ff" 
## [113] "ffdf116461176220.ff"  "ffdf11646121b4bc.ff"  "ffdf116463876af7.ff"  "ffdf116464fc9c6.ff"  
## [117] "ffdf116468ef2484.ff"  "ffdf1164698a2525.ff"  "ffdf11646be0d475.ff"  "ffdf11646bf02837.ff" 
## [121] "ffdf11646c8cd644.ff"  "ffdf11646fa2e3fa.ff"  "ffdf116470fc8942.ff"  "ffdf1164718443f0.ff" 
## [125] "ffdf116471da101a.ff"  "ffdf116473c9d7de.ff"  "ffdf1164759c84a7.ff"  "ffdf11647681f6c2.ff" 
## [129] "ffdf116476c92c43.ff"  "ffdf1164770bc973.ff"  "ffdf116477e50c94.ff"  "ffdf116477e9de02.ff" 
## [133] "ffdf11647803d754.ff"  "ffdf11647a2b3022.ff"  "ffdf11647c277936.ff"  "ffdf11647d18edc7.ff" 
## [137] "ffdf11647e4abe8b.ff"  "ffdf11648e492e1.ff"   "ffdf116498bdd67.ff"   "ffdf1164b526bf2.ff"  
## [141] "ffdf1164ded687.ff"    "ffdf127216b9bb88.ff"  "ffdf127217636cb4.ff"  "ffdf1272178c3c41.ff" 
## [145] "ffdf12721e1bc6bf.ff"  "ffdf12722077a43e.ff"  "ffdf127225c62da9.ff"  "ffdf1272289e32a1.ff" 
## [149] "ffdf1272290e795.ff"   "ffdf12722a0dfcdc.ff"  "ffdf12722cc5b009.ff"  "ffdf12722d66679a.ff" 
## [153] "ffdf12722f14f19f.ff"  "ffdf12723195350a.ff"  "ffdf1272329cb694.ff"  "ffdf127236274487.ff" 
## [157] "ffdf12723acd3cf4.ff"  "ffdf127240fb9e72.ff"  "ffdf127243d2793a.ff"  "ffdf1272475f6fb2.ff" 
## [161] "ffdf12724c0416d7.ff"  "ffdf12724e96c881.ff"  "ffdf12725b8c5cc8.ff"  "ffdf12725bf0f5df.ff" 
## [165] "ffdf12725c1f7d6c.ff"  "ffdf1272621a6405.ff"  "ffdf127266a835d5.ff"  "ffdf127267e50577.ff" 
## [169] "ffdf12726cc9fcea.ff"  "ffdf127275a96b.ff"    "ffdf127275dfc59e.ff"  "ffdf12727753e883.ff" 
## [173] "ffdf127277c7db7a.ff"  "ffdf1272785cd554.ff"  "ffdf12727bc6e95f.ff"  "ffdf12727c53e515.ff" 
## [177] "ffdf127280315df.ff"   "ffdf1272d45b576.ff"   "ffdf1272ebff319.ff"   "ffdf6bcb1356d7f1.ff" 
## [181] "ffdf6bcb13a42184.ff"  "ffdf6bcb1564a06e.ff"  "ffdf6bcb161a8eb6.ff"  "ffdf6bcb17520ac1.ff" 
## [185] "ffdf6bcb184fd221.ff"  "ffdf6bcb1b3f3099.ff"  "ffdf6bcb1bc0f817.ff"  "ffdf6bcb1cd086ef.ff" 
## [189] "ffdf6bcb1fbee15.ff"   "ffdf6bcb2291cf31.ff"  "ffdf6bcb23d7a6ab.ff"  "ffdf6bcb263fb291.ff" 
## [193] "ffdf6bcb299ae13d.ff"  "ffdf6bcb29c5160a.ff"  "ffdf6bcb2c22fc3d.ff"  "ffdf6bcb2cd61c9b.ff" 
## [197] "ffdf6bcb300c7a36.ff"  "ffdf6bcb320b6ac8.ff"  "ffdf6bcb342291b0.ff"  "ffdf6bcb362ad003.ff" 
## [201] "ffdf6bcb3752e308.ff"  "ffdf6bcb38aa7af7.ff"  "ffdf6bcb3aaf2c7b.ff"  "ffdf6bcb3b19f1a4.ff" 
## [205] "ffdf6bcb3cf80612.ff"  "ffdf6bcb4d5b5f23.ff"  "ffdf6bcb4f08e3ac.ff"  "ffdf6bcb516a009c.ff" 
## [209] "ffdf6bcb51b02d0d.ff"  "ffdf6bcb52211d0c.ff"  "ffdf6bcb53e95040.ff"  "ffdf6bcb54ec7ff9.ff" 
## [213] "ffdf6bcb553d0bbd.ff"  "ffdf6bcb5763c322.ff"  "ffdf6bcb57e9de3e.ff"  "ffdf6bcb5dda183.ff"  
## [217] "ffdf6bcb5e41023c.ff"  "ffdf6bcb61e62814.ff"  "ffdf6bcb65320fbd.ff"  "ffdf6bcb66d228b9.ff" 
## [221] "ffdf6bcb684357ea.ff"  "ffdf6bcb69a03e97.ff"  "ffdf6bcb69f783e5.ff"  "ffdf6bcb6a35e902.ff" 
## [225] "ffdf6bcb6ea1cfc.ff"   "ffdf6bcb711a97c3.ff"  "ffdf6bcb74dfb866.ff"  "ffdf6bcb7690d45d.ff" 
## [229] "ffdf6bcb7abd59e9.ff"  "ffdf6bcb7bab3216.ff"  "ffdf6bcb7f953ad7.ff"  "ffdf6bcba147b94.ff"  
## [233] "ffdf6bcba5503df.ff"   "ffdf6bcbaa9cf64.ff"   "ffdf6bcbb49cf26.ff"   "ffdf6bcbb5024da.ff"  
## [237] "ffdf6bcbc8feec5.ff"   "ffdf6bcbf3904ce.ff"   "ffdf6bcbff39d7d.ff"   "ffdf6dad1091932b.ff" 
## [241] "ffdf6dad11e3d015.ff"  "ffdf6dad121a547e.ff"  "ffdf6dad131a8c55.ff"  "ffdf6dad1980acfb.ff" 
## [245] "ffdf6dad1a3c385e.ff"  "ffdf6dad1d351203.ff"  "ffdf6dad1e783006.ff"  "ffdf6dad1fc7fd16.ff" 
## [249] "ffdf6dad207d2be6.ff"  "ffdf6dad2206a194.ff"  "ffdf6dad2254d2f9.ff"  "ffdf6dad23140c14.ff" 
## [253] "ffdf6dad23a93daf.ff"  "ffdf6dad245c0e8e.ff"  "ffdf6dad2493678d.ff"  "ffdf6dad24cd2e33.ff" 
## [257] "ffdf6dad2541aaa8.ff"  "ffdf6dad2881a63a.ff"  "ffdf6dad29f2d846.ff"  "ffdf6dad2b940fb.ff"  
## [261] "ffdf6dad2f43808b.ff"  "ffdf6dad2fd5e3a0.ff"  "ffdf6dad31268d73.ff"  "ffdf6dad31b13798.ff" 
## [265] "ffdf6dad332bbf47.ff"  "ffdf6dad33a07a74.ff"  "ffdf6dad33db4800.ff"  "ffdf6dad38d9be9e.ff" 
## [269] "ffdf6dad3afd2c22.ff"  "ffdf6dad3d20bbd9.ff"  "ffdf6dad3fa90da7.ff"  "ffdf6dad4040b2e4.ff" 
## [273] "ffdf6dad415c825e.ff"  "ffdf6dad42e027e7.ff"  "ffdf6dad4474157e.ff"  "ffdf6dad45451ccd.ff" 
## [277] "ffdf6dad46e83a87.ff"  "ffdf6dad481c4d07.ff"  "ffdf6dad4c5f3d4.ff"   "ffdf6dad4dc29598.ff" 
## [281] "ffdf6dad505311d.ff"   "ffdf6dad50aacf91.ff"  "ffdf6dad5319234f.ff"  "ffdf6dad5434be0c.ff" 
## [285] "ffdf6dad54bea621.ff"  "ffdf6dad55e35cb4.ff"  "ffdf6dad55fab43c.ff"  "ffdf6dad56d9b4f7.ff" 
## [289] "ffdf6dad575cf437.ff"  "ffdf6dad5779cdb2.ff"  "ffdf6dad5c3fc187.ff"  "ffdf6dad5d75c4e7.ff" 
## [293] "ffdf6dad60e0fa52.ff"  "ffdf6dad617cca67.ff"  "ffdf6dad623348c0.ff"  "ffdf6dad67434294.ff" 
## [297] "ffdf6dad67e05b88.ff"  "ffdf6dad6a81e948.ff"  "ffdf6dad6b61ce21.ff"  "ffdf6dad6ea24c66.ff" 
## [301] "ffdf6dad6ecbf44e.ff"  "ffdf6dad713265fe.ff"  "ffdf6dad719c5206.ff"  "ffdf6dad74f579b6.ff" 
## [305] "ffdf6dad754fe702.ff"  "ffdf6dad7660889b.ff"  "ffdf6dad76b511a0.ff"  "ffdf6dad78e59742.ff" 
## [309] "ffdf6dad798bd454.ff"  "ffdf6dad79fab007.ff"  "ffdf6dad7ac6bbd6.ff"  "ffdf6dad7b8765db.ff" 
## [313] "ffdf6dad7cb3f102.ff"  "ffdf6dad851401a.ff"   "ffdf6dad9d18f4.ff"    "ffdf6dad9e86fb9.ff"  
## [317] "ffdf6dadf04dfe7.ff"   "ffdf6dadfaa9604.ff"   "ffdf6f3c1060605e.ff"  "ffdf6f3c157f18c7.ff" 
## [321] "ffdf6f3c17ff68b3.ff"  "ffdf6f3c1b84d496.ff"  "ffdf6f3c1c8614e3.ff"  "ffdf6f3c1ee045c0.ff" 
## [325] "ffdf6f3c2057655f.ff"  "ffdf6f3c2059f1d8.ff"  "ffdf6f3c22409e9c.ff"  "ffdf6f3c22dd8773.ff" 
## [329] "ffdf6f3c24bfaf8a.ff"  "ffdf6f3c267648ba.ff"  "ffdf6f3c26b0e7ec.ff"  "ffdf6f3c26f24f89.ff" 
## [333] "ffdf6f3c284c8267.ff"  "ffdf6f3c293d46e.ff"   "ffdf6f3c2ca6a8d2.ff"  "ffdf6f3c305a1b33.ff" 
## [337] "ffdf6f3c32a0fefa.ff"  "ffdf6f3c32cbaf6.ff"   "ffdf6f3c330c3f11.ff"  "ffdf6f3c35a0025.ff"  
## [341] "ffdf6f3c35d2a054.ff"  "ffdf6f3c36904c38.ff"  "ffdf6f3c3c194c5b.ff"  "ffdf6f3c3d1c9282.ff" 
## [345] "ffdf6f3c3dcc02f.ff"   "ffdf6f3c3ebd43b4.ff"  "ffdf6f3c3f712a91.ff"  "ffdf6f3c41008f7c.ff" 
## [349] "ffdf6f3c441316b2.ff"  "ffdf6f3c4519a162.ff"  "ffdf6f3c46f2f7e6.ff"  "ffdf6f3c48465c58.ff" 
## [353] "ffdf6f3c4b182b0b.ff"  "ffdf6f3c4ba05c7d.ff"  "ffdf6f3c4c0f64ff.ff"  "ffdf6f3c4c208043.ff" 
## [357] "ffdf6f3c4d94df71.ff"  "ffdf6f3c4fe602b4.ff"  "ffdf6f3c51a31caa.ff"  "ffdf6f3c5453969e.ff" 
## [361] "ffdf6f3c560dee36.ff"  "ffdf6f3c5654cd97.ff"  "ffdf6f3c56c76f9a.ff"  "ffdf6f3c59122173.ff" 
## [365] "ffdf6f3c59512811.ff"  "ffdf6f3c5b108b6f.ff"  "ffdf6f3c5c7b7e40.ff"  "ffdf6f3c5dae6140.ff" 
## [369] "ffdf6f3c609743d3.ff"  "ffdf6f3c60992b96.ff"  "ffdf6f3c60a412b7.ff"  "ffdf6f3c641946fd.ff" 
## [373] "ffdf6f3c6704e3e0.ff"  "ffdf6f3c68039b52.ff"  "ffdf6f3c683c58e0.ff"  "ffdf6f3c69c68ace.ff" 
## [377] "ffdf6f3c6c6321b5.ff"  "ffdf6f3c6da3dfd3.ff"  "ffdf6f3c70bcd5df.ff"  "ffdf6f3c70f7450.ff"  
## [381] "ffdf6f3c724c4430.ff"  "ffdf6f3c7350aa4e.ff"  "ffdf6f3c768958a7.ff"  "ffdf6f3c76ad3fe7.ff" 
## [385] "ffdf6f3c786e0ebc.ff"  "ffdf6f3c79a88d70.ff"  "ffdf6f3c7a6618d6.ff"  "ffdf6f3c7b529b69.ff" 
## [389] "ffdf6f3c7bd3ed1f.ff"  "ffdf6f3c7dbcb437.ff"  "ffdf6f3c7dcd6001.ff"  "ffdf6f3c8b93a8d.ff"  
## [393] "ffdf6f3c92bf781.ff"   "ffdf6f3c96c3ca7.ff"   "ffdf6f3c9f033d7.ff"   "ffdf6f3ca2636f2.ff"  
## [397] "ffdf6f3caca2ee9.ff"
# investigate the structure of the object created in the R environment
summary(flights)
##                Length Class     Mode
## year           336776 ff_vector list
## month          336776 ff_vector list
## day            336776 ff_vector list
## dep_time       336776 ff_vector list
## sched_dep_time 336776 ff_vector list
## dep_delay      336776 ff_vector list
## arr_time       336776 ff_vector list
## sched_arr_time 336776 ff_vector list
## arr_delay      336776 ff_vector list
## carrier        336776 ff_vector list
## flight         336776 ff_vector list
## tailnum        336776 ff_vector list
## origin         336776 ff_vector list
## dest           336776 ff_vector list
## air_time       336776 ff_vector list
## distance       336776 ff_vector list
## hour           336776 ff_vector list
## minute         336776 ff_vector list
## time_hour      336776 ff_vector list

Memory mapping with bigmemory

Preparations

# SET UP ----------------

# load packages
library(bigmemory)
library(biganalytics)

Memory mapping with bigmemory

Import data, inspect change in RAM.

# import the data
flights <- read.big.matrix("../data/flights.csv",
                     type="integer",
                     header=TRUE,
                     backingfile="flights.bin",
                     descriptorfile="flights.desc")

Memory mapping with bigmemory

Inspect the imported data.

summary(flights)
##                          min           max          mean           NAs
## year             2013.000000   2013.000000   2013.000000      0.000000
## month               1.000000     12.000000      6.548510      0.000000
## day                 1.000000     31.000000     15.710787      0.000000
## dep_time            1.000000   2400.000000   1349.109947   8255.000000
## sched_dep_time    106.000000   2359.000000   1344.254840      0.000000
## dep_delay         -43.000000   1301.000000     12.639070   8255.000000
## arr_time            1.000000   2400.000000   1502.054999   8713.000000
## sched_arr_time      1.000000   2359.000000   1536.380220      0.000000
## arr_delay         -86.000000   1272.000000      6.895377   9430.000000
## carrier             9.000000      9.000000      9.000000 318316.000000
## flight              1.000000   8500.000000   1971.923620      0.000000
## tailnum                                                  336776.000000
## origin                                                   336776.000000
## dest                                                     336776.000000
## air_time           20.000000    695.000000    150.686460   9430.000000
## distance           17.000000   4983.000000   1039.912604      0.000000
## hour                1.000000     23.000000     13.180247      0.000000
## minute              0.000000     59.000000     26.230100      0.000000
## time_hour        2013.000000   2014.000000   2013.000261      0.000000

Memory mapping with bigmemory

Inspect the object loaded into the R environment.

flights
## An object of class "big.matrix"
## Slot "address":
## <pointer: 0x7fa0d52d3c70>

Memory mapping with bigmemory

  • backingfile: The cache for the imported file (holds the raw data on disk).
  • descriptorfile: Metadata describing the imported data set (also on disk).

Memory mapping with bigmemory

Understanding the role of backingfile and descriptorfile.

First, import a large data set without a backing-file:

# import data and check time needed  
system.time(
     flights1 <- read.big.matrix("../data/flights.csv",
                                 header = TRUE,
                                 sep = ",",
                                 type = "integer")
)
##    user  system elapsed 
##   2.121   0.045   2.218
# import data and check memory used
mem_change(
     flights1 <- read.big.matrix("../data/flights.csv",
                                 header = TRUE,
                                 sep = ",",
                                 type = "integer")
)
## 528 B
flights1 
## An object of class "big.matrix"
## Slot "address":
## <pointer: 0x7fa0d1cd4f90>

Memory mapping with bigmemory

Understanding the role of backingfile and descriptorfile.

Second, import the same data set with a backing-file:

# import data and check time needed  
system.time(
     flights2 <- read.big.matrix("../data/flights.csv",
                                 header = TRUE,
                                 sep = ",",
                                 type = "integer",
                                 backingfile = "flights2.bin",
                                 descriptorfile = "flights2.desc"
                                 )
)
##    user  system elapsed 
##   2.275   0.096   2.503
# import data and check memory used
mem_change(
     flights2 <- read.big.matrix("../data/flights.csv",
                                 header = TRUE,
                                 sep = ",",
                                 type = "integer",
                                 backingfile = "flights2.bin",
                                 descriptorfile = "flights2.desc"
                                 )
)
## 584 B
flights2
## An object of class "big.matrix"
## Slot "address":
## <pointer: 0x7fa0d1f4b010>

Memory mapping with bigmemory

Understanding the role of backingfile and descriptorfile.

Third, re-import the same data set with a backing-file.

# remove the loaded file
rm(flights2)

# 'load' it via the backing-file
system.time(flights2 <- attach.big.matrix("flights2.desc"))
##    user  system elapsed 
##   0.001   0.001   0.001
flights2
## An object of class "big.matrix"
## Slot "address":
## <pointer: 0x7fa0d5502c30>

Cleaning and Transformation

Typical tasks (independent of data set size)

  • Normalize/standardize.
  • Code additional variables (indicators, strings to categorical, etc.).
  • Remove, add covariates.
  • Merge data sets.
  • Set data types.

Typical workflow

  1. Import raw data.
  2. Clean/transform.
  3. Store for analysis.
    • Write to file.
    • Write to database.

Bottlenecks

  • RAM:
    • Raw data does not fit into memory.
    • Transformations enlarge RAM allocation (copying).
  • Mass Storage: Reading/Writing
  • CPU: Parsing (data types)

Data Preparation with ff

Set up

The following examples are based on Walkowiak (2016), Chapter 3.

## SET UP ------------------------

#Set working directory to the data and airline_id files.
# setwd("materials/code_book/B05396_Ch03_Code")
system("mkdir ffdf")
options(fftempdir = "ffdf")

# load packages
library(ff)
library(ffbase)
library(pryr)

# fix vars
FLIGHTS_DATA <- "../code_book/B05396_Ch03_Code/flights_sep_oct15.txt"
AIRLINES_DATA <- "../code_book/B05396_Ch03_Code/airline_id.csv"

Data import

# DATA IMPORT ------------------

# 1. Upload flights_sep_oct15.txt and airline_id.csv files from flat files. 

system.time(flights.ff <- read.table.ffdf(file=FLIGHTS_DATA,
                                          sep=",",
                                          VERBOSE=TRUE,
                                          header=TRUE,
                                          next.rows=100000,
                                          colClasses=NA))
## read.table.ffdf 1..100000 (100000)  csv-read=0.861sec ffdf-write=0.206sec
## read.table.ffdf 100001..200000 (100000)  csv-read=1.07sec ffdf-write=0.136sec
## read.table.ffdf 200001..300000 (100000)  csv-read=1.207sec ffdf-write=0.165sec
## read.table.ffdf 300001..400000 (100000)  csv-read=1.08sec ffdf-write=0.123sec
## read.table.ffdf 400001..500000 (100000)  csv-read=1.054sec ffdf-write=0.12sec
## read.table.ffdf 500001..600000 (100000)  csv-read=1.225sec ffdf-write=0.127sec
## read.table.ffdf 600001..700000 (100000)  csv-read=1.394sec ffdf-write=0.123sec
## read.table.ffdf 700001..800000 (100000)  csv-read=1.628sec ffdf-write=0.144sec
## read.table.ffdf 800001..900000 (100000)  csv-read=1.476sec ffdf-write=0.11sec
## read.table.ffdf 900001..951111 (51111)  csv-read=0.784sec ffdf-write=0.105sec
##  csv-read=11.779sec  ffdf-write=1.359sec  TOTAL=13.138sec
##    user  system elapsed 
##  10.890   0.897  13.143
airlines.ff <- read.csv.ffdf(file= AIRLINES_DATA,
                             VERBOSE=TRUE,
                             header=TRUE,
                             next.rows=100000,
                             colClasses=NA)
## read.table.ffdf 1..1607 (1607)  csv-read=0.011sec ffdf-write=0.009sec
##  csv-read=0.011sec  ffdf-write=0.009sec  TOTAL=0.02sec
# check memory used
mem_used()
## 1,022,326,912 B

Comparison with read.table

##Using read.table()
system.time(flights.table <- read.table(FLIGHTS_DATA, 
                                        sep=",",
                                        header=TRUE))
##    user  system elapsed 
##   8.790   0.506   9.670
gc()
##             used   (Mb) gc trigger   (Mb) limit (Mb)  max used   (Mb)
## Ncells   1361877   72.8    2521611  134.7         NA   2518811  134.6
## Vcells 131579857 1003.9  213476332 1628.7      16384 213466427 1628.7
system.time(airlines.table <- read.csv(AIRLINES_DATA,
                                       header = TRUE))
##    user  system elapsed 
##   0.006   0.000   0.007
# check memory used
mem_used()
## 1,128,933,880 B

Inspect imported files

# 2. Inspect the ffdf objects.
## For flights.ff object:
class(flights.ff)
## [1] "ffdf"
dim(flights.ff)
## [1] 951111     28
## For airlines.ff object:
class(airlines.ff)
## [1] "ffdf"
dim(airlines.ff)
## [1] 1607    2

Data cleaning and transformation

Goal: merge airline data to flights data

# step 1: 
## Rename "Code" variable from airlines.ff to "AIRLINE_ID" and "Description" into "AIRLINE_NM".
names(airlines.ff) <- c("AIRLINE_ID", "AIRLINE_NM")
names(airlines.ff)
## [1] "AIRLINE_ID" "AIRLINE_NM"
str(airlines.ff[1:20,])
## 'data.frame':    20 obs. of  2 variables:
##  $ AIRLINE_ID: int  19031 19032 19033 19034 19035 19036 19037 19038 19039 19040 ...
##  $ AIRLINE_NM: Factor w/ 1607 levels "40-Mile Air: Q5",..: 945 1025 503 721 64 725 1194 99 1395 276 ...

Data cleaning and transformation

Goal: merge airline data to flights data

# merge of ffdf objects
mem_change(flights.data.ff <- merge.ffdf(flights.ff, airlines.ff, by="AIRLINE_ID"))
## 656 kB
class(flights.data.ff)
## [1] "ffdf"
dim(flights.data.ff)
## [1] 951111     29
dimnames.ffdf(flights.data.ff)
## [[1]]
## NULL
## 
## [[2]]
##  [1] "YEAR"              "MONTH"             "DAY_OF_MONTH"      "DAY_OF_WEEK"      
##  [5] "FL_DATE"           "UNIQUE_CARRIER"    "AIRLINE_ID"        "TAIL_NUM"         
##  [9] "FL_NUM"            "ORIGIN_AIRPORT_ID" "ORIGIN"            "ORIGIN_CITY_NAME" 
## [13] "ORIGIN_STATE_NM"   "ORIGIN_WAC"        "DEST_AIRPORT_ID"   "DEST"             
## [17] "DEST_CITY_NAME"    "DEST_STATE_NM"     "DEST_WAC"          "DEP_TIME"         
## [21] "DEP_DELAY"         "ARR_TIME"          "ARR_DELAY"         "CANCELLED"        
## [25] "CANCELLATION_CODE" "DIVERTED"          "AIR_TIME"          "DISTANCE"         
## [29] "AIRLINE_NM"

Inspect difference to in-memory operation

##For flights.table:
names(airlines.table) <- c("AIRLINE_ID", "AIRLINE_NM")
names(airlines.table)
## [1] "AIRLINE_ID" "AIRLINE_NM"
str(airlines.table[1:20,])
## 'data.frame':    20 obs. of  2 variables:
##  $ AIRLINE_ID: int  19031 19032 19033 19034 19035 19036 19037 19038 19039 19040 ...
##  $ AIRLINE_NM: Factor w/ 1607 levels "40-Mile Air: Q5",..: 945 1025 503 721 64 725 1194 99 1395 276 ...
# check memory usage of merge in RAM 
mem_change(flights.data.table <- merge(flights.table,
                                       airlines.table,
                                       by="AIRLINE_ID"))
## 118 MB

Type conversion: ff factor

# Inspect the current variable
table.ff(flights.data.ff$DAY_OF_WEEK)
## 
##      1      2      3      4      5      6      7 
## 131267 143057 145057 148544 148159 112239 122788
head(flights.data.ff$DAY_OF_WEEK)
## [1] 1 1 1 1 1 1
# Convert numeric ff DAY_OF_WEEK vector to a ff factor:
flights.data.ff$WEEKDAY <- cut.ff(flights.data.ff$DAY_OF_WEEK, 
                                   breaks = 7, 
                                   labels = c("Monday", "Tuesday", 
                                              "Wednesday", "Thursday", 
                                              "Friday", "Saturday",
                                              "Sunday"))
# inspect the result
head(flights.data.ff$WEEKDAY)
## [1] Monday Monday Monday Monday Monday Monday
## Levels: Monday Tuesday Wednesday Thursday Friday Saturday Sunday
table.ff(flights.data.ff$WEEKDAY)
## 
##    Monday   Tuesday Wednesday  Thursday    Friday  Saturday    Sunday 
##    131267    143057    145057    148544    148159    112239    122788

Subsetting

mem_used()
## 1,247,893,104 B
# Subset the ffdf object flights.data.ff:
subs1.ff <- subset.ffdf(flights.data.ff, CANCELLED == 1, 
                        select = c(FL_DATE, AIRLINE_ID, 
                                   ORIGIN_CITY_NAME,
                                   ORIGIN_STATE_NM,
                                   DEST_CITY_NAME,
                                   DEST_STATE_NM,
                                   CANCELLATION_CODE))

dim(subs1.ff)
## [1] 4529    7
mem_used()
## 1,248,120,536 B

Save to ffdf-files

(For further processing with ff)

# Save a newly created ffdf object to a data file:

save.ffdf(subs1.ff, overwrite = TRUE) #7 files (one for each column) created in the ffdb directory

Load ffdf-files

# Loading previously saved ffdf files:
rm(subs1.ff)
gc()
##             used   (Mb) gc trigger   (Mb) limit (Mb)  max used   (Mb)
## Ncells   1384197   74.0    3261861  174.3         NA   3261861  174.3
## Vcells 146342907 1116.6  213476332 1628.7      16384 213466427 1628.7
load.ffdf("ffdb")
str(subs1.ff)
## List of 3
##  $ virtual: 'data.frame':    7 obs. of  7 variables:
##  .. $ VirtualVmode     : chr  "integer" "integer" "integer" "integer" ...
##  .. $ AsIs             : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  .. $ VirtualIsMatrix  : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  .. $ PhysicalIsMatrix : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  .. $ PhysicalElementNo: int  1 2 3 4 5 6 7
##  .. $ PhysicalFirstCol : int  1 1 1 1 1 1 1
##  .. $ PhysicalLastCol  : int  1 1 1 1 1 1 1
##  .. - attr(*, "Dim")= int  4529 7
##  .. - attr(*, "Dimorder")= int  1 2
##  $ physical: List of 7
##  .. $ FL_DATE          : list()
##  ..  ..- attr(*, "physical")=Class 'ff_pointer' <externalptr> 
##  ..  .. ..- attr(*, "vmode")= chr "integer"
##  ..  .. ..- attr(*, "maxlength")= int 4529
##  ..  .. ..- attr(*, "pattern")= chr "ffdf"
##  ..  .. ..- attr(*, "filename")= chr "/Users/umatter/Dropbox/Teaching/HSG/BigData/BigData/materials/slides/ffdb/subs1.ff$FL_DATE.ff"
##  ..  .. ..- attr(*, "pagesize")= int 65536
##  ..  .. ..- attr(*, "finalizer")= chr "close"
##  ..  .. ..- attr(*, "finonexit")= logi TRUE
##  ..  .. ..- attr(*, "readonly")= logi FALSE
##  ..  .. ..- attr(*, "caching")= chr "mmnoflush"
##  ..  ..- attr(*, "virtual")= list()
##  ..  .. ..- attr(*, "Length")= int 4529
##  ..  .. ..- attr(*, "Symmetric")= logi FALSE
##  ..  .. ..- attr(*, "Levels")= chr [1:61] "2015-09-01" "2015-09-02" "2015-09-03" "2015-09-04" ...
##  ..  .. ..- attr(*, "ramclass")= chr "factor"
##  .. .. - attr(*, "class") =  chr [1:2] "ff_vector" "ff"
##  .. $ AIRLINE_ID       : list()
##  ..  ..- attr(*, "physical")=Class 'ff_pointer' <externalptr> 
##  ..  .. ..- attr(*, "vmode")= chr "integer"
##  ..  .. ..- attr(*, "maxlength")= int 4529
##  ..  .. ..- attr(*, "pattern")= chr "ffdf"
##  ..  .. ..- attr(*, "filename")= chr "/Users/umatter/Dropbox/Teaching/HSG/BigData/BigData/materials/slides/ffdb/subs1.ff$AIRLINE_ID.ff"
##  ..  .. ..- attr(*, "pagesize")= int 65536
##  ..  .. ..- attr(*, "finalizer")= chr "close"
##  ..  .. ..- attr(*, "finonexit")= logi TRUE
##  ..  .. ..- attr(*, "readonly")= logi FALSE
##  ..  .. ..- attr(*, "caching")= chr "mmnoflush"
##  ..  ..- attr(*, "virtual")= list()
##  ..  .. ..- attr(*, "Length")= int 4529
##  ..  .. ..- attr(*, "Symmetric")= logi FALSE
##  .. .. - attr(*, "class") =  chr [1:2] "ff_vector" "ff"
##  .. $ ORIGIN_CITY_NAME : list()
##  ..  ..- attr(*, "physical")=Class 'ff_pointer' <externalptr> 
##  ..  .. ..- attr(*, "vmode")= chr "integer"
##  ..  .. ..- attr(*, "maxlength")= int 4529
##  ..  .. ..- attr(*, "pattern")= chr "ffdf"
##  ..  .. ..- attr(*, "filename")= chr "/Users/umatter/Dropbox/Teaching/HSG/BigData/BigData/materials/slides/ffdb/subs1.ff$ORIGIN_CITY_NAME.ff"
##  ..  .. ..- attr(*, "pagesize")= int 65536
##  ..  .. ..- attr(*, "finalizer")= chr "close"
##  ..  .. ..- attr(*, "finonexit")= logi TRUE
##  ..  .. ..- attr(*, "readonly")= logi FALSE
##  ..  .. ..- attr(*, "caching")= chr "mmnoflush"
##  ..  ..- attr(*, "virtual")= list()
##  ..  .. ..- attr(*, "Length")= int 4529
##  ..  .. ..- attr(*, "Symmetric")= logi FALSE
##  ..  .. ..- attr(*, "Levels")= chr [1:305] "Abilene, TX" "Akron, OH" "Albany, GA" "Albany, NY" ...
##  ..  .. ..- attr(*, "ramclass")= chr "factor"
##  .. .. - attr(*, "class") =  chr [1:2] "ff_vector" "ff"
##  .. $ ORIGIN_STATE_NM  : list()
##  ..  ..- attr(*, "physical")=Class 'ff_pointer' <externalptr> 
##  ..  .. ..- attr(*, "vmode")= chr "integer"
##  ..  .. ..- attr(*, "maxlength")= int 4529
##  ..  .. ..- attr(*, "pattern")= chr "ffdf"
##  ..  .. ..- attr(*, "filename")= chr "/Users/umatter/Dropbox/Teaching/HSG/BigData/BigData/materials/slides/ffdb/subs1.ff$ORIGIN_STATE_NM.ff"
##  ..  .. ..- attr(*, "pagesize")= int 65536
##  ..  .. ..- attr(*, "finalizer")= chr "close"
##  ..  .. ..- attr(*, "finonexit")= logi TRUE
##  ..  .. ..- attr(*, "readonly")= logi FALSE
##  ..  .. ..- attr(*, "caching")= chr "mmnoflush"
##  ..  ..- attr(*, "virtual")= list()
##  ..  .. ..- attr(*, "Length")= int 4529
##  ..  .. ..- attr(*, "Symmetric")= logi FALSE
##  ..  .. ..- attr(*, "Levels")= chr [1:52] "Alabama" "Alaska" "Arizona" "Arkansas" ...
##  ..  .. ..- attr(*, "ramclass")= chr "factor"
##  .. .. - attr(*, "class") =  chr [1:2] "ff_vector" "ff"
##  .. $ DEST_CITY_NAME   : list()
##  ..  ..- attr(*, "physical")=Class 'ff_pointer' <externalptr> 
##  ..  .. ..- attr(*, "vmode")= chr "integer"
##  ..  .. ..- attr(*, "maxlength")= int 4529
##  ..  .. ..- attr(*, "pattern")= chr "ffdf"
##  ..  .. ..- attr(*, "filename")= chr "/Users/umatter/Dropbox/Teaching/HSG/BigData/BigData/materials/slides/ffdb/subs1.ff$DEST_CITY_NAME.ff"
##  ..  .. ..- attr(*, "pagesize")= int 65536
##  ..  .. ..- attr(*, "finalizer")= chr "close"
##  ..  .. ..- attr(*, "finonexit")= logi TRUE
##  ..  .. ..- attr(*, "readonly")= logi FALSE
##  ..  .. ..- attr(*, "caching")= chr "mmnoflush"
##  ..  ..- attr(*, "virtual")= list()
##  ..  .. ..- attr(*, "Length")= int 4529
##  ..  .. ..- attr(*, "Symmetric")= logi FALSE
##  ..  .. ..- attr(*, "Levels")= chr [1:306] "Abilene, TX" "Akron, OH" "Albany, GA" "Albany, NY" ...
##  ..  .. ..- attr(*, "ramclass")= chr "factor"
##  .. .. - attr(*, "class") =  chr [1:2] "ff_vector" "ff"
##  .. $ DEST_STATE_NM    : list()
##  ..  ..- attr(*, "physical")=Class 'ff_pointer' <externalptr> 
##  ..  .. ..- attr(*, "vmode")= chr "integer"
##  ..  .. ..- attr(*, "maxlength")= int 4529
##  ..  .. ..- attr(*, "pattern")= chr "ffdf"
##  ..  .. ..- attr(*, "filename")= chr "/Users/umatter/Dropbox/Teaching/HSG/BigData/BigData/materials/slides/ffdb/subs1.ff$DEST_STATE_NM.ff"
##  ..  .. ..- attr(*, "pagesize")= int 65536
##  ..  .. ..- attr(*, "finalizer")= chr "close"
##  ..  .. ..- attr(*, "finonexit")= logi TRUE
##  ..  .. ..- attr(*, "readonly")= logi FALSE
##  ..  .. ..- attr(*, "caching")= chr "mmnoflush"
##  ..  ..- attr(*, "virtual")= list()
##  ..  .. ..- attr(*, "Length")= int 4529
##  ..  .. ..- attr(*, "Symmetric")= logi FALSE
##  ..  .. ..- attr(*, "Levels")= chr [1:52] "Alabama" "Alaska" "Arizona" "Arkansas" ...
##  ..  .. ..- attr(*, "ramclass")= chr "factor"
##  .. .. - attr(*, "class") =  chr [1:2] "ff_vector" "ff"
##  .. $ CANCELLATION_CODE: list()
##  ..  ..- attr(*, "physical")=Class 'ff_pointer' <externalptr> 
##  ..  .. ..- attr(*, "vmode")= chr "integer"
##  ..  .. ..- attr(*, "maxlength")= int 4529
##  ..  .. ..- attr(*, "pattern")= chr "ffdf"
##  ..  .. ..- attr(*, "filename")= chr "/Users/umatter/Dropbox/Teaching/HSG/BigData/BigData/materials/slides/ffdb/subs1.ff$CANCELLATION_CODE.ff"
##  ..  .. ..- attr(*, "pagesize")= int 65536
##  ..  .. ..- attr(*, "finalizer")= chr "close"
##  ..  .. ..- attr(*, "finonexit")= logi TRUE
##  ..  .. ..- attr(*, "readonly")= logi FALSE
##  ..  .. ..- attr(*, "caching")= chr "mmnoflush"
##  ..  ..- attr(*, "virtual")= list()
##  ..  .. ..- attr(*, "Length")= int 4529
##  ..  .. ..- attr(*, "Symmetric")= logi FALSE
##  ..  .. ..- attr(*, "Levels")= chr [1:4] "" "A" "B" "C"
##  ..  .. ..- attr(*, "ramclass")= chr "factor"
##  .. .. - attr(*, "class") =  chr [1:2] "ff_vector" "ff"
##  $ row.names:  NULL
## - attributes: List of 2
##  .. $ names: chr [1:2] "virtual" "physical"
##  .. $ class: chr "ffdf"
dim(subs1.ff)
## [1] 4529    7
dimnames(subs1.ff)
## [[1]]
## NULL
## 
## [[2]]
## [1] "FL_DATE"           "AIRLINE_ID"        "ORIGIN_CITY_NAME"  "ORIGIN_STATE_NM"  
## [5] "DEST_CITY_NAME"    "DEST_STATE_NM"     "CANCELLATION_CODE"

Export to CSV

#  Export subs1.ff into CSV and TXT files:
write.csv.ffdf(subs1.ff, "subset1.csv")

References

Walkowiak, Simkon. 2016. Big Data Analytics with R. Birmingham, UK: PACKT Publishing.

Wickham, Hadley. 2019. Advanced R. Second Edition. Boca Raton, FL: CRC Press.